Conversation
…y tracking (#433, #434, #435) Implements three follow-up features from #431 execution status classification: - --retry-errors <jsonl>: re-run only execution_error test cases from a previous output - execution.fail_on_error config: true (halt on first), false (never halt), or 0.0-1.0 threshold - errorRetries field on EvaluationResult to track transient errors retried during provider invocation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval-schema-sync test requires the Zod schema to be the source of truth. Adds FailOnErrorSchema to ExecutionSchema and regenerates the JSON schema to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite threshold test to exercise actual ratio math (succeed → succeed → fail → fail → fail triggers halt at 3/5=0.60 > 0.5) - Fix docs range notation from 0.0-1.0 to >0.0-1.0 (exclusive of 0) - Add concurrency best-effort note to docs - Add comment explaining why 0 is excluded from numeric thresholds - Add lightweight validation (testId + score) in loadNonErrorResults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
4ea4508
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://42e50489.agentv.pages.dev |
| Branch Preview URL: | https://feat-follow-up-431-error-tra.agentv.pages.dev |
…old) Align with industry standards (promptfoo, braintrust) by keeping fail_on_error as a simple true/false toggle. The numeric ratio threshold (0.0-1.0) was YAGNI — post-hoc analysis of JSONL output is sufficient for error ratio decisions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e to CLAUDE.md Remove ErrorRetry interface, errorRetries field on EvaluationResult, and retry tracking code — no industry precedent, and retry count can be added later if needed. Add YAGNI as design principle #4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements three follow-up features from #431 (execution status classification):
--retry-errors <jsonl>(Add --retry-errors CLI flag to re-run only execution errors #433): Re-run onlyexecution_errortest cases from a previous output. Non-error results are preserved and merged into the new output.execution.fail_on_error(Add fail_on_error tolerance config for eval runs #434): Configurable error tolerance —true(halt on first error),false(never halt, default), or0.0–1.0threshold ratio.errorRetries(Track retried transient errors in eval results for diagnostics #435): Track transient errors (timeouts) that were retried during provider invocation, attached to the finalEvaluationResult.Closes #433, closes #434, closes #435
Test plan
loadErrorTestIds/loadNonErrorResults(5 tests)extractFailOnErrorconfig parser (9 tests)fail_on_errororchestrator behavior (3 tests: true, threshold, false)errorRetriestracking (2 tests: with/without retries)🤖 Generated with Claude Code